Problem Set 10

Author
Affiliation

Vesth. Laust.

Department of Economics, Aarhus University - BSS, Fuglesangsalé 4.

Display Libraries
library(tidyverse)
library(knitr)
library(wooldridge)
library(dplyr)        # data manipulation
library(haven)
library(kableExtra)
library(DT)
library(gtsummary)
library(latex2exp)
library(broom)       # regression summaries
library(stargazer)
library(car)         # hypothesis tests
library(Hmisc)
library(ggeffects)
library(lmtest)
library(sandwich)
library(fixest)
library(ggplot2)     # plotting
library(tidyr)       # reshaping tables
library(tsibble)     # time series structures
library(lubridate)   # date handling
library(fable)       # TSLM, model()
library(fabletools)  # tidy helpers
library(modelsummary)
library(estimatr)
Chapter 16 summary

Summary

Simultaneous equations models are appropriate when firmly grounded in counterfactual reasoning. In par-ticular, each equation in the system should have a ceteris paribus interpretation. Good examples are when separate equations describe different sides of a market or the behavioral relationships of different economic agents. Supply and demand examples are leading cases, but there are many other applications of SEMs in economics and the social sciences. An important feature of SEMs is that, by fully specifying the system, it is clear which variables are assumed to be exogenous and which ones appear in each equation. Given a full system, we are able to determine which equations can be identified (that is, can be estimated). In the important case of a two-equation system, identification of (say) the first equation is easy to state: at least one exogenous variable must be excluded from the first equation that appears with a nonzero coefficient in the second equation. As we know from previous chapters, OLS estimation of an equation that contains an endogenous explanatory variable generally produces biased and inconsistent estimators. Instead, 2SLS can be used to estimate any identified equation in a system. More advanced system methods are available, but they are beyond the scope of our treatment. The distinction between omitted variables and simultaneity in applications is not always sharp. Both prob-lems, not to mention measurement error, can appear in the same equation. A good example is the labor supply of married women. Years of education (educ) appears in both the labor supply and the wage offer functions [see equations (16.19) and (16.20)]. If omitted ability is in the error term of the labor supply function, then wage and education are both endogenous. The important thing is that an equation estimated by 2SLS can stand on its own. SEMs can be applied to time series data as well. As with OLS estimation, we must be aware of trend-ing, integrated processes in applying 2SLS. Problems such as serial correlation can be handled as in Section 15-7. We also gave an example of how to estimate an SEM using panel data, where the equation is first dif-ferenced to remove the unobserved effect. Then, we can estimate the differenced equation by pooled 2SLS, just as in Chapter 15. Alternatively, in some cases, we can use time-demeaning of all variables, including the IVs, and then apply pooled 2SLS; this is identical to putting in dummies for each cross-sectional obser-vation and using 2SLS, where the dummies act as their own instruments. SEM applications with panel data are very powerful, as they allow us to control for unobserved heterogeneity while dealing with simultane-ity. They are becoming more and more common and are not especially difficult to estimate.


Exercise 1.

Demand and Supply

Wooldridge Exercise 16.2 (p. 553)

Display Task



Display Solution

Solution

The first equation is about the price and income.This will be a sound expression for the demand equation, since it reflects the income per capita and the price per bushel of corn.

The second expression is about price and rainfall. This would be a sound expression for the supply of corn. This is due to the fact, that the supply of corn is determined by the price of corn and how much that is produced in a given year which is an important determinant for the amount of corn produced. Rainfall might be a good determinant for how much corn that is able to be produced in a given year.


Exercise 2.

SEM identification

Wooldridge Exercise 16.4 (p. 553)

Display Task



Display Solution

Solution

So, we have that a SEM (Simultaneous Equations Model) is a system of equations where multiple endogeneous variables are determined together - implying that some of the explanatory variables in each equation are themselves dependent variables in other equations in the system.

In this instance we have that:

\[ (1) \ \ \ \ \ \ \log{(\text{earnings})}=\beta_0+\beta_1\text{alcohol}+\beta_2\text{educ}+u_1 \] \[ (2) \ \ \ \ \ \ \ \text{alcohol}=\gamma_0+\gamma_1\log{(\text{earnings})}+\gamma_2\text{educ}+\gamma_3\log{(\text{price})}+u_2 \]

The key features in this equation:

  • Endogenous variables: log(earnings) and alcohol (they appear as dependent variables in one equation and explanatory variables in another).

  • Exogeneous variables: educ and log(price) (given, not determined in the system).

  • Disturbances: u1 and u2 are unobservable error terms.


Identification and Estimation

Question: Which equation is identified, and how would you estimate it?

To check if an equation is identified, we use order and rank conditions. Order conditions are necessary but not sufficient.

For an order condition to be exactly identified, the number of excluded exogenous variables from the equation must be at least equal to the number of endogenous variables minus one.

Per this definition only equation \((1)\) is exactly identified. This equation can be estimated using 2SLS because alcohol is an endogenous variable (it is correlated with \(u_1\)).

Equation \((1)\) can be measured using 2SLS by first regressing alcohol on all exogenous variables: educ, log(price). The the predicted values, \(\hat{alcohol}\), is obtained. In the second stage one would then regress log(earnings on \(\hat{alcohol}\) (all the predicted values from the first stage). This would yield the consistent estimates.


Conclusion:

Only the first equation is identified because we have price in the other equation, which we can use as an instrumental variable in the first equation.

Therefore, we can use log(price) as an instrumental variable for alcohol to estimate the first equation.

In the first stage equation it is just the endogenous variable that is regressed on all the exogenous variables in the given equation.


Exercise 3.

SEM

Wooldridge Exercise 16.5 (p. 553)

Display Task



Display Solution

Solution


(i)

The dependent variable is measured as the percentage of sexually active students who at some point has contracted a sexually transmitted disease. It would be expected that \(\beta_1<0\) because the use of condoms is assumed to lower the number of participants in the population who has contracted a sexually transmitted disease.


(ii)

Male participants that determines to engage in sexual activities simultaneously also determines their use of condoms. Thus, the participant that the analysis observes engages in two simultaneous activities. This is the reason that the two variables could be jointly determined.

The decision to or not to purchase condoms also depends on the ratio of healthy individuals in the population versus the ratio of individuals infected with STD’s. This is, that infrate influences consuse.


(iii)

Given the two equations:

\[ (i) \ \ \ \ \ \ y_1=\beta_0+\beta_1 y_2+\beta_2 z_1+\beta_3 z_2+\beta_4 z_3 + u_1 \]

\[ (ii) \ \ \ \ \ \ y_2=\gamma_0+\gamma_1 y_1 + u_2 \] Thus, we strive to find the reduced form for the second equation:
\[ y_2=\gamma_0+\gamma_1 (\beta_0+\beta_1 y_2+\beta_2 z_1+\beta_3 z_2+\beta_4 z_3 + u_1) + u_2 \] \[ \Longrightarrow y_2=(\gamma_0+\gamma_1\beta_0)+(\gamma_1\beta_1 y_2) + (\gamma_1\beta_2 z_1)+(\gamma_1\beta_3 z_2)+(\gamma_1 \beta_4 z_3)+(\gamma_1u_1+u_2) \] \[ \Longrightarrow y_2-\gamma_1\beta_1 y_2=(\gamma_0+\gamma_1\beta_0) + (\gamma_1\beta_2 z_1)+(\gamma_1\beta_3 z_2)+(\gamma_1 \beta_4 z_3)+(\gamma_1u_1+u_2) \] \[ \Longrightarrow y_2(1-\gamma_1\beta_1)=(\gamma_0+\gamma_1\beta_0) + (\gamma_1\beta_2 z_1)+(\gamma_1\beta_3 z_2)+(\gamma_1 \beta_4 z_3)+(\gamma_1u_1+u_2) \]

\[ \Longrightarrow y_2=\frac{(\gamma_0+\gamma_1\beta_0)+\gamma_1\beta_2 z_1+\gamma_1 \beta_3 z_2 + \gamma_1 \beta_4 z_3+(\gamma_1 u_1+u_2)}{1-\gamma_1\beta_1} \] \[ =y_2=\pi_0+\pi_1 z_1+\pi_2 z_2+\pi_3 z_3+v_2 \] Where,

\[ \pi_0=\frac{\gamma_0+\gamma_1\beta_0}{1-\gamma_1\beta_1} \] \[ \pi_1=\frac{\gamma_1\beta_2}{1-\gamma_1\beta_1} \]

\[ v_2=\frac{\gamma_1u_1+u_2}{1-\gamma_1\beta_1} \]

It is now possible to find the covariance between \(y_2\) and \(u_1\). This will cause OLS to be biased in the first equation.

\[ Cov(y_2,u_1)=Cov(v_2,u_1) \] \[ =Cov(\frac{\gamma_1u_1+u_2}{1-\gamma_1 \beta_1},u_1) \] \[ =Cov(\frac{\gamma_1u_1}{1-\gamma_1\beta_1},u_1)+Cov(\frac{u_2}{1-\gamma_1\beta_1},u_1) \] \[ =\frac{\gamma_1}{1-\gamma_1\beta_1}\cdot Cov(u_1,u_1)+\frac{1}{1-\gamma_1\beta_1}\cdot Cov(u_2,u_1) \] \[ =\frac{\gamma_1}{1-\gamma_1\beta_1}\cdot V(u_1)+\frac{1}{1-\gamma_1\beta_1}\cdot 0 \] \[ =\frac{\gamma_1}{1-\gamma_1\beta_1}\cdot \sigma_1^2 \ \ >\ 0 \] We assumed \(\beta_1<0\) and \(\gamma_1>0\). This bias will indeed take a positive value.


(iv)

The variable condis must satisfy the requirements of relevance and exogeneity. This will imply that condis must affect infrate only through conuse. This will allow us to use condis as an instrumental variable to estimate the first regression equation.

Exogeneity states that condis must not appear directly in the structural equation \((i)\), and that it must not be correlated with the error terms in the structural equation.

Relevance states that condis must appear in the second equation \((ii)\) for conuse and that the coefficient in front of this variable must be positive because it then will cause a correlation between condis and conuse. This will make condis satisfies the relevance condition.

Here it would be of relevance to talk about that infrate might influence condis because if schools know infrate is high they might participate in the programme. If schools is randomly chosen it would make it a perfect instrumental variable.


Exercise 4.

Demand and Supply revisited

Wooldridge Exercise 16.7 (p. 554)

Display Task



Display Solution

Solution

(i)

It is important to account for time trends to avoid specular growth or decline. Popularity of women’s basketball might drift upward as the programme wins national attention, marketing budgets rise, local population grows, etc. Also it is important to account for to reduce omitted variable-bias. If all variables trend but the trend is left out, the regression will spurriously ascribe the overall movement to one or more included regressors (e.g. price or winning-percentage). Time-trends must also be accounted for to make sure that conditions such as stationarity is satisfied. Regressing a trending dependent variable on non-trending regressors violates the zero-mean/stationarity assumptions underlying the usual t- and F-tests. Adding a deterministic trend often restores those assumptions.


(ii)

No.
Even though quantity supplied is literally capped at capacity, the athletic office almost surely sets price in anticipation of demand-side factors that coincide in the error-term, \(u_t\).

This could include:

  • Expected wather.

  • Local events competing for fans’ time.

  • One-time promotions.

  • Existence of star players not captured by WINPERC_t.

If price reacts to these unobserved demand shifters, \(Cov(\ln{\text{PRICE}}_t, u_t) \neq 0\), and OLS is biased.

The economic implication: Price should be treated as an endogenous regressor and estimated by some instrumental variables (IV’s) or a controle-function approach.


(iii)

So here we can propose that,

\(Z_t=\text{SEASPERC}_{t-1}\) as an instrument for \(\ln{\text{PRICE}}_t\).

Thus, we must require that the above satisfies the relevance- and exogeneity-condition. The relevance-condition holds if \(Cov(Z_t,\ln{\text{PRICE}}_t)\neq 0\). This implies that the administration bases this season’s ticket price partly on last season’s team quantity. The exogeneity-condition holds if \(Cov(Z_t, u_t)=0\). Conditional on the regressors already in the model, last season’s wins do not directly affect this season’s game-specific attendance except through price. This would imply that there persists no serial correlation from last season’s wins to unobserved taste shocks this year, and that fans care mostly about current performance (captured by WINPERC_t).

If these conditions do actually hold, \(Z_t\), is indeed a valid instrument.


(iv)

Yes. It is reasonable. If the men’s and women’s games are either substitutes (most likely) or complements.

If tickets are substitutes (fans attend whichever is cheaper) the predicted sign would be positive which implies that higher men’s price “pushes” fans toward women’s games.

If tickets are complements (e.g., joint packages, double header days …) the sign would be negative. Higher men’s price would discourage the bundle, lowering women’s demand.


(v)

We can account for the problem by two strategies.

Strategy 1: First-difference the entire equation. \[ \Delta \ln{(\text{ATTEND}_t)}=\beta_1\Delta\ln{(\text{PRICE}_t)}+\beta_2\Delta\text{WINPERC}_t+\beta_3\Delta\text{RIVAL}_t+\beta_4\Delta\text{WEEKEND}_t+\beta_5+\Delta u_t \] We have that;

  • Differences eliminate unit roots of the order 1.

  • The time-trend disappears (its difference is a constant).

  • If price is endogenous, IV’s is required (e.g., \(\Delta\text{SEASPERC}_{t-1}\))

Strategy 2: Error-correction / co-integration (ECM).

If \(\ln{\text{ATTEND}}\) and \(\ln{\text{PRICE}}\) are co-integrated, the following must be estimated:

\[ \Delta\ln{(\text{ATTEND}_t)}=\gamma[\ln{(\text{ATTEND}_{t-1})-\alpha_0-\alpha_1\ln{(\text{PRICE}_{t-1})-...}}]+\text{short-run terms}+\varepsilon_t \]

(vi)

When a game sells out, the observed attendance equals capacity, not true demand. This yields a form of left-censoring (Tobit) or truncation:

\[ \text{ATTEND}_t^{obs} = \begin{cases} \text{ATTEND}_t^* & \text{if } \text{ATTEND}_t^* < \text{CAP} \\ \text{CAP} & \text{if } \text{ATTEND}_t^* \geq \text{CAP} \end{cases} \] OLS (or IV) on the censored values produces some kind of downward-biased elasticities because high-demand games are flattened at the ceiling.

Remedies

  • Exclude capacity games - keeps ordinary regression but discards information.

  • Censored (Tobit) maximum-likelihood - treats capacity as a known censoring point.

  • Poisson or Negative-Binomial with “exposure” - useful if counts are directly modelled.

  • Latent-demand IV-Tobit - combine endogeneity correction with censoring.


Exercise 5.

Estimation of SEM

Wooldridge Exercise 16.C1 (p. 555)

Display Task



Display Solution

Solution

(i)

\(\beta_1\) is the coefficient of cigs. This value indicates if cigs is correlated with income. If a positive correlation persists it would indicate that a percentage increase in income would lead to an increase in consumption of cigarettes. If a negative value persists it would imply that a percentage increase in income would lead to less consumption of cigarettes. The variable cigs is the average number of cigarettes smoked per day (a level, not a log). We have that because the left side is in logs while cigs is in levels, \(\beta_1\), is a semi-elasticity:

\(\beta_1 \times 100\) is the percentage change in annual income associated with smoking one more cigarette per day, ceteris paribus.

If \(\beta_1<0\), each extra cigarette cuts expected income; if \(\beta_1>0\) (unlikely), the opposite holds.


(ii)

The model for cigs:

\[ \text{cigs}=\gamma_0+\gamma_1\log{(\text{income})}+\gamma_2\text{educ}+\gamma_3\text{age}+\gamma_4\text{age}^2+\gamma_5\log{(\text{cigpric})}+\gamma_6\text{restuarn}+u_2 \]

  • cigpric is the retail price of a pack. Demand theory suggests that higher prices leads to lower consumption which implies that \(\gamma_5<0\).

  • restuarn equals one if the person lives in a state where smoking in resturants is restricted. Such policies raise the full cost of smoking because it becomes less convenient which leads to lower leves of utility, thus consumption. This suggests that \(\gamma_6<0\).


(iii)

The income equation:

\[ \log{(\text{income})}=\beta_0+\beta_1\text{cigs}+\beta_2\text{educ}+\beta_3\text{age}+\beta_4\text{age}^2+u_1 \] Where we want to treat cigs as endogenous in this equation.

The system is just-identified if we assume that;

  • Exclusion restriction persists - log(cigpric) and restuarn do not appear in the income equation except through cigs.

  • The instrument is exogenous - those variables are uncorrelated with \(u_1\) (the unobserved determinants of income).

If both hold, the two excluded variables give us at least as many valid instruments as endogenous regressors (one), so the structural parameter \(\beta_1\) is identified.

 

(iv)

Estimating the income equation in R by OLS and discussing the parameter \(\beta_1\):

Display Code
load("C:/Users/laust/Documents/Fag/4. Sem/Econometrics/ProblemSets/ProblemSet11/ProblemSet11/smoke.RData")
OLS_model <- feols(lincome ~ cigs + educ + age + I(age^2), se="hetero", data=data)
tidy_model <- tidy(OLS_model)
kbl(tidy_model, booktabs=TRUE, digits=4, format="html") %>% kable_styling(full_width = FALSE)
term estimate std.error statistic p.value
(Intercept) 7.7954 0.2082 37.4424 0.0000
cigs 0.0017 0.0014 1.2108 0.2263
educ 0.0604 0.0075 8.0995 0.0000
age 0.0577 0.0092 6.2692 0.0000
I(age^2) -0.0006 0.0001 -6.4083 0.0000


So from the coefficient on cigs - that is \(\beta_1\) we observe that one extra cigarette per day is associated with \(\approx 0,17%\) higher income, because lincome is in natural log format. We observe that the p-value is at \(0,22\). Thus, we fail to reject \(H_0: \ \beta_1=0\) at any usual level.


(v)

We estimate the reduced form cigs.

Display Code
reduced_form_cigs <- feols(cigs ~ educ + age + I(age^2) + log(cigpric) + restaurn, se="hetero", data=data)
tidy_reduced_form_cigs <- tidy(reduced_form_cigs)
kbl(tidy_reduced_form_cigs, format="html", digits=4, booktabs = TRUE) %>% kable_styling(full_width=FALSE)
term estimate std.error statistic p.value
(Intercept) 1.5801 25.1903 0.0627 0.9500
educ -0.4501 0.1558 -2.8891 0.0040
age 0.8225 0.1356 6.0670 0.0000
I(age^2) -0.0096 0.0014 -6.6674 0.0000
log(cigpric) -0.3513 6.0271 -0.0583 0.9535
restaurn -2.7364 1.0006 -2.7346 0.0064

 

We assume that cigs is an endogenous variable. In simultaneous equation setups we distinguish between structural equations such as \(\log{(\text{income})}=\beta_0+\beta_1\text{cigs}+\beta_2\text{educ}+\beta_3\text{age}+\beta_4\text{age}^2+u_1\) where cigs indeed is endogenous.

The reduced form for the endogenous variable - we must regress that endogenous variable on every exogenous variable in the whole system. That is \(\text{cigs}=\delta_0+\delta_1\text{educ}+\delta_2\text{age}+\delta_3\text{age}^2+\delta_4\log{(\text{cigpric})}+\delta_5\text{restaurn}+v\).

Interpretation

We observe that log(cigpric) is estimated to take the value \(-0.3513\) with a p-value of \(0.9535\). Thus, this estimate is not statistically significant from zero. In this sample cigarette prices carry almost no predictive power for how much a person smokes. It is a weak instrument. Although this is, we see that there persists a negative relationship between the log price of cigarettes and the average consumption of cigarettes.

We observe that restaurn is estimated to take the value \(-2.7364\) with a p-value of \(0.0064\). This implies that the estimate is indeed of statistical significance. Living in a state with restaurant-smoking bans cuts consumption by approximately 2.7 cigarettes per day.


(vi)

Solving the task in R.

Display Code
IV_feols <- feols(lincome ~ educ + age + I(age^2) | cigs ~ restaurn, data=data, se="hetero")
modelsummary(list("2SLS"=IV_feols, "OLS"=OLS_model),
             title="2SLS and OLS model comparison",
             notes="Models: IV_feols, OLS_model.",
             statistic="({std.error})", 
             stars=c('*'=.1, '**'=.05, '***'=.01), 
             coef_map=c("cigs" = "Cigarettes / day",
                        "fit_cigs"="Cigarettes / day",
                        "educ"="Years of Education",
                        "age"="Age",
                        "I(age^2)"="Age\u00B2",
                        "(Intercept)"="Constant"
                       ),
             gof_omit="AIC|BIC",
             output="kableExtra",
             align="lll",
             escape=FALSE,
             add_header_above=c("  "=1),
             kable_styling=list(full_width=FALSE, bootstrap_options=c("striped", "condensed")),
             )

Conducting the 2SLS regression yields the following results seen in Table 1.

Table 1: 2SLS and OLS model comparison
 2SLS OLS
Cigarettes / day -0.041* 0.002
(0.024) (0.001)
Years of Education 0.040*** 0.060***
(0.015) (0.007)
Age 0.093*** 0.058***
(0.022) (0.009)
Age² -0.001*** -0.001***
(0.000) (0.000)
Constant 7.781*** 7.795***
(0.257) (0.208)
Num.Obs. 807 807
R2 -0.494 0.165
R2 Adj. -0.502 0.161
RMSE 0.87 0.65
Std.Errors Heteroskedasticity-robust Heteroskedasticity-robust
* p < 0.1, ** p < 0.05, *** p < 0.01


Here we observe that with lincome on the left-hand side (LHS) and cigs in levels, the coefficient is a semi-elasticity at approximately 0.2% change in annual income for one more cigarette per day. It is small, positive and statistically insignificant. The 2SLS model predicts that instrumenting for cigs with the resturant-ban dummy-variable restaurn flips the magnitude. This implies that each additional cigarette per day now lowers expected income by about 4.1%. This result is statistically significant at the 10%-level. The estimates may differ because of different sources of bias. Endogenous smoking choice implies that health, stress tolerance, or job characteristics that raise wages can also raise smoking, producing positive correlation between cigs and the error term. The OLS-coefficient is biased upward (it even turns weakly positive). This is fixed via. 2SLS-regression. 2SLS uses a policy-induced shift in smoking, restaurn, that is indeed orthogonal to those wage-boosting unobservables, purging the bias. The IV-result (2SLS) therefore has somewhat more of a “causal” interpretation, whereas the OLS figure reflects only conditional correlation.

The economic interpretation of the IV-coefficient reflects that semi-elasticity must be interpreted in percentage terms. We have that \(\beta_{1,2SLS}=-0.041\Longrightarrow 1 \ \text{extra cigarette / day} \ \approx 4.1\% \ \text{lower annual income}\). Someone who smokes 10 more cigarettes per day (half a pack) would, on average earn about 35-40% less than an otherwise identical non-smoker, ceteris paribus.

We observe that \(R^2\) is actually negative in the 2SLS analysis. This is common in IV-analysis due to the fact that fitted values are computed with less variation. The goodness-of-fit is not the goal of the analysis. The unbiasedness is the goal of this kind of analysis.

So, the OLS understates (and even reverses) the true effect of smoking on earnings. Once we control for endogeneity via. 2SLS, smoking shows a substantial, economically meaningful, negative impact on income. This is consistent with health-productivity channels and prior litterature.


(vii)

We are asked if cigarette prices, log(cigpric) and restaurant-smoking bans, restaurn, are plausibly exogenous in the income equation.

Exogeneity

The real cigarette price (state tax + wholesale cost) is set largely by state-level excise taxes and national producers. These decisions do not react with the individuals wage. Households are price-takers, so a single worker cannot influence the retail price. Although this holds there might be playsible threaths to this condition. The cost of living correlation persists which implies that richer states often have hogher prices via higher taxes or distribution costs. This makes price positively correlated with the wage component \(u_1\). The selective migration principle states that high-earning, health conscious workers may move to states with high price/taxes so as to discourage smoking. This generates a negative correlation. This could be mitigated through state fixed effects or region dummy variables so comparisons are within state. It could also be controlled for state-level average income / COL-index in both equations. A Sargan Hansen test could be run over both instruments. A failure to reject this test would support joint exogeneity.

The restaurant smoking restriction is a pure policy variable chosen by legislation, presumably unrelated to any single individuals income. Variation is at the state-year level so individual shocks in \(u_1\) cannot affect adoption. Regarding policy endogeneity it might be seen that richer and more health-oriented states might adopt bans earlier. This would show up in \(u_1\). Sorting could also be a talking point. High income and health-conscious people could relocate to ban states. This would also correlate with \(u_1\). Difference-out time-invariant state traits with fixed effects if panel data are available could mitigate this problem. State-level socio-economic controls could be added to soak up policy adoption determinants (education spending, median income…). It would be wise to check the first stage F and the over-ID test. If price is weak but the ban is strong, methods such as weak-IV-robust checks must be relied upon.


Exercise 6.

SEM IV estimation

Wooldridge Exercise 16.9 (p. 557)

Display Task



Display Solution

Solution

(i)

The sign would be positive because the average would take a positive value.


(ii)

The equation is given.

\[ \log{(\text{passen})}=\beta_{10}+\alpha_1\log{(\text{fare})}+\beta_{11}\log{(\text{dist})}+\beta_{12}[\log{(\text{dist})}]^2+u \] Here log(passen) is a quantity demanded of average number of passengers per day on a given route. fare is the average (one-way) ticket price on that route. dist is the route distance in miles.

Because the model is written in logs, \(\alpha_1\) is litterally the price-elasticity of demand and can be written as follows.

\[ \alpha_1=\frac{\partial \log{(\text{passen})}}{\partial \log{(\text{fare})}}=\frac{\partial Q / Q}{\partial P / P}=\frac{\% \ \text{change in quantity}}{\% \ \text{change in price}} \] The core micro economic logic: The law of demand.

  • The substitution effect: When the price of a commercial flight rises, air travel becomes relatively more expensive than its alternatives such as driving, taking the train etc. Travelers substitute away from the now costlier option.

  • Income (or budget) effect: A higher airfare leaves would-be travelers with less real purchasing power. With a smaller “travel budget”, they reduce consumption of normal goods such as leisure trips and even some business trips (many firms monitor travel costs closely).

For normal goods, both effects work in the same direction. Quantity demanded falls when prices rise, so the derivative \(\frac{\partial Q}{\partial P}\) is negative. This implies that \(\alpha_1<0\).

Elasticity is a way of measuring responsiveness. How much does one variable change when another variable changes? This is what elasticity captures…

Note

Units does not matter. Elasticity is unit-free. It does not care if prices are in some units or if quantity is in other units. Elasticity is measured in percentage changes. It would be wise to ask oneself the questions: Is the dependent variable in logs? Is the price variable also in logs? If answered yes to both it is likely that the element of interest is indeed some sort of elasticity.


(ii)

The equation is estimated in R.

Display Code
load("C:/Users/laust/Documents/Fag/4. Sem/Econometrics/ProblemSets/ProblemSet11/ProblemSet11/airfare.RData")
air97 <- subset(data, year==1997)
OLS_model <- feols(lpassen ~ lfare + ldist + ldistsq, data=air97, se="hetero")
modelsummary(list("OLS Model"=OLS_model),
             statistic="({std.error})", 
             stars=c('*'=.1, '**'=.05, '***'=.01), 
             coef_map=c("(Intercept)" = "Constant",
                        "lfare"="Log(fare)",
                        "ldist"="Log(dist)",
                        "ldistsq"="Log(dist)^2"
                       ),
             gof_omit="AIC|BIC",
             output="kableExtra",
             align="ll",
             escape=FALSE,
             add_header_above=c("  "=1),
             kable_styling=list(full_width=FALSE, bootstrap_options=c("striped", "condensed")),
             )

Regression the equation from (i) with the relevant data from the year 1997 yields the following results.

OLS Model
Constant 13.230***
(2.289)
Log(fare) -0.391***
(0.063)
Log(dist) -1.570**
(0.690)
Log(dist)^2 0.116**
(0.052)
Num.Obs. 1149
R2 0.057
R2 Adj. 0.054
RMSE 0.82
Std.Errors Heteroskedasticity-robust
* p < 0.1, ** p < 0.05, *** p < 0.01


We observe that the coefficient for fare is estimated to be -0.391 with high statistical significance. This is the elasticity. It has a negative slope.

\[ \hat{\alpha}_1=-0.391 \] The magnitude of the elasticity suggests inelasticity. That is \(|\varepsilon|<1\). A 1% increase in price results in a 0.39% fall in passenger volume - ceteris paribus.

This can be visualized as follows.

Display Code
ggplot(air97, aes(x=lfare, y=lpassen)) + geom_point(alpha=0.4) + geom_smooth(method="lm", se=FALSE, colour="orange", linewidth=1) + labs(title="log(passen) and log(fare) with OLS fit (1997 routes)", x="log(fare)", y="log(passen)" + theme_minimal())

If demand is truly inelastic, a monopolist could raise fares and increase total revenue.

 

(iii)

The idea is that we must be able with confidence to state that once we have controlled for price and the other factors already in the model (distance and its square), any remaining differences in passenger demand are statistically unrelated to how concentrated the route is. Mathematically:

\[ \text{Cov}(Concen, \ u)=0 \] where \(u\) is the composite of “all other things” that make one route busier than another.

Note

So, to call concen exogenous in the demand equation it must be believed that after controlling for price and distance - market concentration is indeed determined by supply side aspects that have nothing whatsoever to do with passengers’ unobserved preferences or shocks to demand.


(iv)

The setup is a little different now. Because we are treating concen as exogeneous, the supply side collapses to a single reduced-form relationship which is formally stated as follows.

\[ \log{(\text{fare})}=\gamma_0+\gamma_1\text{concen}+\gamma_2\log{(\text{dist})}+\gamma_3[\log{(\text{dist})}]^2+v \] log(fare) is the average one-way fare on the route (dependent variable). concen is the share of passengers carried by the largest airline on that route (0 - 1). The two distance terms soak up systematic cost differences across short and long routes. The goal is to estimate \(\gamma_1\) and confirm that \(\gamma_1>0\).

The equation is estimated in R.

Display Code
OLS_model2 <- feols(lfare ~ + concen + ldist + ldistsq, se="hetero", data=air97)
modelsummary(list("Model with `concen`"=OLS_model2, "OLS-model"=OLS_model),
             statistic="({std.error})", 
             stars=c('*'=.1, '**'=.05, '***'=.01), 
             coef_map=c("(Intercept)" = "Constant",
                        "lfare"="Log(fare)",
                        "ldist"="Log(dist)",
                        "ldistsq"="Log(dist)^2",
                        "concen"="Concentration"
                       ),
             gof_omit="AIC|BIC",
             output="kableExtra",
             align="lll",
             escape=FALSE,
             add_header_above=c("  "=1),
             kable_styling=list(full_width=FALSE, bootstrap_options=c("striped", "condensed")),
             )

The regression yeilds following comparison.

Model with `concen` OLS-model
Constant 6.190*** 13.230***
(1.000) (2.289)
Log(fare) -0.391***
(0.063)
Log(dist) -0.936*** -1.570**
(0.299) (0.690)
Log(dist)^2 0.108*** 0.116**
(0.022) (0.052)
Concentration 0.395***
(0.068)
Num.Obs. 1149 1149
R2 0.408 0.057
R2 Adj. 0.406 0.054
RMSE 0.36 0.82
Std.Errors Heteroskedasticity-robust Heteroskedasticity-robust
* p < 0.1, ** p < 0.05, *** p < 0.01


The estimated partial effect of concentration is strongly positive and statistically significant at \[ \hat{\gamma}_1=\frac{\partial \log{(\text{fare})}}{\partial \text{concen}}=0.395 \] Moving concen by 0.10 (a 10-percentage-point shift in market share) raises fares by roughly 4% holding route length constant - ceteris paribus:

\[ \Delta\%\text{fare}\approx0.395\times0.10=3.95\% \]

Note

According to the market power point from earlier a dominant carrier faces less head-to-head competition, so it can charge higher prices. The result just derived confirms that concentration is indeed a relevant (strong) instrument for price in the IV/2SLS step that follows.


(v)

Cheat sheet for estimating IV-models in R is given:

Display Code
iv_model <- iv_robust(lpassen ~ lfare + ldist + ldistsq | concen + ldist + ldistsq, data=air97)
Model with `concen` OLS-model  IV-model
Constant 6.190*** 13.230*** 18.014***
(1.000) (2.289) (3.457)
Log(fare) -0.391*** -1.174***
(0.063) (0.410)
Log(dist) -0.936*** -1.570** -2.176***
(0.299) (0.690) (0.777)
Log(dist)^2 0.108*** 0.116** 0.187***
(0.022) (0.052) (0.065)
Concentration 0.395***
(0.068)
Num.Obs. 1149 1149 1149
R2 0.408 0.057 -0.055
R2 Adj. 0.406 0.054 -0.058
RMSE 0.36 0.82 0.87
Std.Errors Heteroskedasticity-robust Heteroskedasticity-robust
* p < 0.1, ** p < 0.05, *** p < 0.01

According to the analysis in the IV-model the estimate for the elasticity is certainly more steep. fare is not exactly 100% exogenous. Routes with many business-travelers might charge higher prices or somehow attract more passengers than assumed. This would yield a positive correlation between the variable fare and the error term which causes the OLS-coefficient towards zero. When using the concen variable as an instrument the regression is “purged” or “cleansed” from the endogeneity and a value more close to the “true” value is estimated as an elasticity.


(vii)

From the IV-model we observe the coefficients on distance and distance squared.

\[ \hat{\beta_1}=-2.176, \ \ \ \ \ \ \hat{\beta}_2=0.187 \] These are our two estimates. The signs tells us that the amount of passengers falls when the distance in log terms becomes greater - to start with that is. This trends reverses when the distance reach a certain milestone. This milestone can be calculated. The marginal effect must be equated with zero as follows.

\[ \frac{d\hat{\log{(\text{passen})}}}{d\log{(\text{dist})}}=\beta_1+2\beta_2\log{(\text{dist})}=0 \] \[ \Longrightarrow \log{(\text{dist})^\ast}=-\frac{\beta_1}{2\beta_2}\approx\frac{2.176}{0.374}=5.82 \] \[ \Longrightarrow \text{dist}^\ast=e^{5.82}\approx336 \ \text{miles} \approx \ 540 \ \text{km} \] Which can be visualized as follows.

Display Code
library(ggplot2)

#Indtast IV-estimater fra opgave (del v)
b0 <- 18.014       # konstant
b1 <- -1.174       # log(fare)
b2 <- -2.176       # log(dist)
b3 <-  0.187       # [log(dist)]^2


#Gennemsnitlig log(fare)
mean_lfare <- mean(air97$lfare, na.rm = TRUE)

#Først definerer vi et område, hvor vi ser på grafen:

#Skab et grid af afstande (100 til 1.500 miles)
miles <- seq(100, 1500, length.out = 300)
log_dist <- log(miles)
log_dist_sq <- log_dist^2

#Beregn forudsagte log(passagerer)
log_passen_hat <- b0 + b1 * mean_lfare + b2 * log_dist + b3 * log_dist_sq
passen_hat <- exp(log_passen_hat)



#Generer grafen:
df_plot <- data.frame(miles, passen_hat)

ggplot(df_plot, aes(x = miles, y = passen_hat)) +
  geom_line(linewidth = 1.2, colour = "orange") +
  geom_vline(xintercept = 336, colour = "black") +
  labs(x = "Ruteafstand (miles)",
       y = "Forudsagt antal passagerer (per dag)",
       title = "Efterspørgsel vs. ruteafstand (IV-estimat)") +
  theme_minimal() +
  theme(plot.title = element_text(size = 14, face = "bold"))


Exercise 7.

Estimation of Engel curves

Wooldridge Exercise 16.C11 (p. 557)

Display Task



Display Solution

Solution

(i)

Display Code
load("C:/Users/laust/Documents/Fag/4. Sem/Econometrics/ProblemSets/ProblemSet11/ProblemSet11/expendshares.RData")
sfood_summary <- summary(data$sfood)
tidy_sfood_summary <- tidy(sfood_summary)
kbl(tidy_sfood_summary, format="html", booktabs=TRUE, digits=4) %>% kable_styling(full_width=FALSE)
boxplot(data$sfood, main = "Food-budget share (sfood)")

Generating a quick summary over the column sfood yields the following results.

minimum q1 median mean q3 maximum
0.0571 0.2817 0.354 0.3565 0.4258 0.789


Observing the minimum and maximum values for the column sfood it can be concluded that the share of spending on food never hits zero. This is unsurprising. In virtually every household in the sample at least some money is spent on food each week, so a zero share would indeed be rare in a cross-section of the U.K. families.


(ii)

The following equation is estimated in R.

\[ \text{sfood}_i=\beta_0+\beta_1\text{ltotexpend}_i+\beta_2\text{age}_i+\beta_3\text{kids}_i+u_i \]
Display Code
OLS_model <- feols(sfood ~ ltotexpend + age + kids, data=data, se="hetero")
modelsummary(list("OLS Model"=OLS_model),
             statistic="({std.error})", 
             stars=c('*'=.1, '**'=.05, '***'=.01), 
             coef_map=c("(Intercept)" = "Constant",
                        "ltotexpend"="Log of total expenditure",
                        "age"="Age",
                        "kids"="Kids"
                       ),
             gof_omit="AIC|BIC",
             output="kableExtra",
             align="ll",
             escape=FALSE,
             add_header_above=c("  "=1),
             kable_styling=list(full_width=FALSE, bootstrap_options=c("striped", "condensed")),
             )
OLS Model
Constant 0.896***
(0.029)
Log of total expenditure -0.146***
(0.006)
Age 0.002***
(0.000)
Kids 0.034***
(0.005)
Num.Obs. 1519
R2 0.286
R2 Adj. 0.285
RMSE 0.09
Std.Errors Heteroskedasticity-robust
* p < 0.1, ** p < 0.05, *** p < 0.01


The coefficient on ltotexpend is estimated to be statistically significant at a value of -0.146.

Interpreting the model

  • A one unit-increase in log total expenditure reduces the food share by about 14.6 percentage points, holding age and number of children fixed - ceteris paribus.

  • The t-statistic is indeed over 20 in absolute value, so the decline is highly significant. This would suggest that richer households devote proportionally less of their budget to food. This is a classical Engel-curve result.


(iii)

The first stage (reduced form) regression is as follows.

\[ \text{ltotexpend}_i=\pi_0+\pi_1\text{lincome}_i+\pi_2\text{age}_i+\pi_4\text{kids}_i+v_i \] In R the feols command can do the regression in just one step. Therefore the first-stage equation is just for show so as to generate a better understanding of the underlying process.

Display Code
lincomeIV_model <- feols(sfood ~ age + kids | 0 | ltotexpend ~ lincome, data=data, se="hetero")
Note

Breaking down the code.

Everything before the |0| is the standard OLS regression. This is:

\[ \text{sfood}_i=\beta_0+\beta_1\text{ltotexpend}_i+\beta_2\text{age}_i+\beta_3\text{kids}_i+u_i \] After the |0| the term ltotexpend ~ lincome is expressed. This states that R must intrument ltotexpend using lincome.

The |0| states that R must not include any fixed effects. So this is really just a placeholder for some fixed effects - if we were to include some.

The table is generated with the following code.

modelsummary(list("OLS Model"=OLS_model, "IV Model"=lincomeIV_model),
             statistic="({std.error})", 
             stars=c('*'=.1, '**'=.05, '***'=.01), 
             coef_map=c("(Intercept)" = "Constant",
                        "ltotexpend"="Log of total expenditure",
                        "age"="Age",
                        "kids"="Kids",
                        "fit_ltotexpend"="Log of total expenditure"
                       ),
             gof_omit="AIC|BIC",
             output="kableExtra",
             align="lll",
             escape=FALSE,
             add_header_above=c("  "=1),
             kable_styling=list(full_width=FALSE, bootstrap_options=c("striped", "condensed")),
             )

Conducting the 2SLS analysis yields the following comparison.

OLS Model  IV Model
Constant 0.896*** 0.952***
(0.029) (0.054)
Log of total expenditure -0.146*** -0.160***
(0.006) (0.013)
Age 0.002*** 0.002***
(0.000) (0.000)
Kids 0.034*** 0.035***
(0.005) (0.005)
Num.Obs. 1519 1519
R2 0.286 0.284
R2 Adj. 0.285 0.282
RMSE 0.09 0.09
Std.Errors Heteroskedasticity-robust Heteroskedasticity-robust
* p < 0.1, ** p < 0.05, *** p < 0.01


The 2SLS yeilds a slightly more negative estimate of the coefficient ltotexpend (log of total expenditure). The IV standard error is about double the OLS standard error, as expected when instrumented. Correcting for possible endogeneity of total expenditure does not overturn the Engel-curve pattern. If anything it reinforces the conclusion that higher total spending lowers food share.


Note

As an exercise some of the data is visualized below.

Display Code
data$pred_ols <- predict(OLS_model)
data$pred_iv <- predict(lincomeIV_model)
library(ggplot2)

ggplot(data, aes(x=ltotexpend, y=sfood)) +
  geom_point(alpha=0.3, color="gray") +
  geom_line(aes(y=pred_ols), color="blue", size=1.2, linetype="dashed") +
  geom_line(aes(y=pred_iv), color="red", size=1.2) +
  labs(title="Engel Curve: OLS & 2SLS",
       subtitle = "Dashed blue = OLS | Red = 2SLS (IV)",
       x="Log Total Expenditure (ltotexpend)",
       y="Share of Spending on Food (sfood)"
       ) +
  theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

library(ggplot2)
library(dplyr)

## 1. Representative household values --------------------------------------
avg_age  <- mean(data$age,  na.rm = TRUE)
avg_kids <- round(mean(data$kids, na.rm = TRUE))
avg_inc  <- mean(data$lincome, na.rm = TRUE)

## 2. Prediction grid ------------------------------------------------------
grid <- data.frame(
  ltotexpend = seq(min(data$ltotexpend),
                   max(data$ltotexpend),
                   length.out = 150),
  age     = avg_age,
  kids    = avg_kids,
  lincome = avg_inc
)

## 3. OLS predictions + SEs  (works) ---------------------------------------
pred_ols <- predict(OLS_model, newdata = grid, se.fit = TRUE)

grid$ols_fit <- as.numeric(pred_ols$fit)
grid$ols_se  <- as.numeric(pred_ols$se.fit)

## 4. 2SLS predictions  (fit only) -----------------------------------------
grid$iv_fit <- as.numeric(
  predict(lincomeIV_model, newdata = grid, se.fit = FALSE)
)
## (no iv_se because fixest cannot return it yet)

## 5. 95 % CI for the OLS curve -------------------------------------------
crit <- 1.96
grid <- grid %>%
  mutate(
    ols_lo = ols_fit - crit * ols_se,
    ols_hi = ols_fit + crit * ols_se
  )

## 6. Plot -----------------------------------------------------------------
ggplot() +
  geom_point(data = data,
             aes(x = ltotexpend, y = sfood),
             alpha = .25, colour = "grey60") +

  # OLS ribbon + dashed line
  geom_ribbon(data = grid,
              aes(x = ltotexpend, ymin = ols_lo, ymax = ols_hi),
              fill = "blue", alpha = .15) +
  geom_line(data = grid,
            aes(x = ltotexpend, y = ols_fit, colour = "OLS"),
            linetype = "dashed", linewidth = 1.2) +

  # IV solid line
  geom_line(data = grid,
            aes(x = ltotexpend, y = iv_fit, colour = "2SLS"),
            linewidth = 1.2) +

  scale_colour_manual(
    name   = "Model",
    values = c("OLS" = "blue", "2SLS" = "red"),
    labels = c("OLS (dashed)", "2SLS (IV)")
  ) +
  labs(title = "Engel Curve: OLS vs 2SLS",
       subtitle = "Shaded band = 95 % CI for OLS (robust)",
       x = "Log Total Expenditure (ltotexpend)",
       y = "Share of Spending on Food (sfood)") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "top")
ggplot(data, aes(x = factor(kids), y = sfood)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Food Share by Number of Children",
       x = "Number of Kids", y = "Food Share (sfood)") +
  theme_minimal()
ggplot(data, aes(x = age, y = sfood)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", se = TRUE, color = "darkgreen") +
  labs(title = "Food Share vs Age of Household Head",
       x = "Age", y = "Food Share (sfood)") +
  theme_minimal()

Category Description
Graph mechanics Predicted food-share values from OLS (dashed blue) and 2SLS (solid red) are plotted for each household. Lines connect predictions in row order. Because each household has different values for age and kids, the lines jump vertically, creating a comb-like appearance.
Economic intuition Despite the noise, a downward trend is visible: food shares tend to decline as log-expenditure increases, consistent with Engel’s law. The 2SLS line lies slightly above OLS, suggesting OLS underestimates the true Engel-curve slope due to endogeneity.
Analytical relevance Demonstrates the consequences of plotting raw predicted values without holding covariates constant. Highlights the influence of demographic heterogeneity on fitted values.
Use and purpose Primarily diagnostic. Helps assess heterogeneity in fitted values and supports the methodological shift toward using a smoothed conditional prediction grid.


Category Description
Graph mechanics A prediction grid is created where ltotexpend varies, but age, kids, and lincome are held constant. OLS predictions use built-in se.fit; IV predictions use the delta method. Confidence intervals are added around the fitted lines.
Economic intuition Both Engel curves slope downward, showing that food shares decline with total expenditure. The IV curve lies slightly above the OLS curve, indicating a flatter true Engel curve due to measurement error in OLS.
Analytical relevance Translates regression output into a visual form. Makes it possible to assess both the estimated trend and the associated uncertainty across the expenditure distribution.
Use and purpose Suitable for reports and presentations. Makes results more interpretable and facilitates comparison between OLS and IV. Allows visual inspection of statistical significance via band overlap.


Category Description
Graph mechanics Box-and-whisker plot groups sfood by kids. The box shows the interquartile range, the middle line is the median, and outliers are plotted as individual dots.
Economic intuition Food share slightly increases with the number of children. Larger families may spend a greater proportion of their total budget on food.
Analytical relevance Justifies inclusion of kids as a control variable in the Engel-curve model. Indicates demographic variation in budget allocation.
Use and purpose Useful in exploratory data analysis. Helps identify whether categorical or continuous specification of kids is appropriate in regression. Flags outliers that could affect estimation.


Category Description
Graph mechanics LOESS smoothing is applied to sfood across age, creating a flexible trend line with 95 % confidence bands. Shows the local average of food share conditional on age.
Economic intuition The relationship is nearly flat in mid-life, with slight declines in the 30s and mild increases in the 50s. Suggests that very young and older household heads may devote a higher food share.
Analytical relevance Supports the inclusion of age in the Engel-curve regression. The approximate linearity of the trend justifies using a linear term rather than a more complex polynomial.
Use and purpose Helps detect non-linearities. Clarifies how age affects food share and verifies that the linear model does not omit key structure. Appropriate for exploratory analysis and presentation.


(iv)

The analysis is conducted in R.

Display Code
# Get tidy output with confidence intervals directly from the models
tidy_ols <- tidy(OLS_model, conf.int = TRUE)
tidy_iv  <- tidy(lincomeIV_model, conf.int = TRUE)

# Filter just the row for ltotexpend (or adjust if term name differs)
ols_row <- tidy_ols[tidy_ols$term == "ltotexpend", ]
iv_row  <- tidy_iv[tidy_iv$term == "fit_ltotexpend", ]  # or "ltotexpend" if needed

# Combine into a comparison table
comparison <- data.frame(
  Model     = c("OLS", "2SLS"),
  Estimate  = c(ols_row$estimate, iv_row$estimate),
  Std_Error = c(ols_row$std.error, iv_row$std.error),
  CI_Lower  = c(ols_row$conf.low, iv_row$conf.low),
  CI_Upper  = c(ols_row$conf.high, iv_row$conf.high)
)

# Format and display with kbl
kbl(comparison, digits = 4, caption = "Comparison of OLS and 2SLS Estimates for ltotexpend") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
modelsummary(list("OLS Model"=OLS_model, "IV Model"=lincomeIV_model),
             statistic="({std.error})", 
             stars=c('*'=.1, '**'=.05, '***'=.01), 
             coef_map=c("(Intercept)" = "Constant",
                        "ltotexpend"="Log of total expenditure",
                        "age"="Age",
                        "kids"="Kids",
                        "fit_ltotexpend"="Log of total expenditure"
                       ),
             gof_omit="AIC|BIC",
             conf.int=TRUE,
             conf_level = 0.95,
             output="kableExtra",
             align="lll",
             escape=FALSE,
             add_header_above=c("  "=1),
             kable_styling=list(full_width=FALSE, bootstrap_options=c("striped", "condensed")),
             )
Comparison of OLS and 2SLS Estimates for ltotexpend
Model Estimate Std_Error CI_Lower CI_Upper
OLS -0.1459 0.0062 -0.1581 -0.1337
2SLS -0.1600 0.0129 -0.1853 -0.1347
OLS Model  IV Model
Constant 0.896*** 0.952***
(0.029) (0.054)
Log of total expenditure -0.146*** -0.160***
(0.006) (0.013)
Age 0.002*** 0.002***
(0.000) (0.000)
Kids 0.034*** 0.035***
(0.005) (0.005)
Num.Obs. 1519 1519
R2 0.286 0.284
R2 Adj. 0.285 0.282
RMSE 0.09 0.09
Std.Errors Heteroskedasticity-robust Heteroskedasticity-robust
* p < 0.1, ** p < 0.05, *** p < 0.01

Point estimate comparison:

  • The IV estimate (−0.1600) is more negative than the OLS estimate (−0.1459).

  • This suggests that OLS underestimates the true effect of ltotexpend on the food budget share.

  • The discrepancy likely reflects endogeneity bias in OLS — for example, measurement error in total expenditure or omitted variables that correlate with both spending and food share.

Confidence interval comparison:

  • The 95% CI from 2SLS is wider than that from OLS.

  • This is expected: IV estimation involves more sampling uncertainty because it relies on an instrument (lincome) to isolate variation in ltotexpend.

  • Both intervals are tight, and they do not overlap much, implying a statistically meaningful difference between the OLS and IV estimates.


(v)

Testing for endogeneity.

Display Code
#Fist stage
fs_model <- feols(ltotexpend ~ lincome + age + kids, data=data)

#Extracting residuals
data$vhat <- resid(fs_model)

#Adding the residuals to the original equation.
endog_test <- feols(sfood ~ ltotexpend + age + kids + vhat, data=data, se="hetero")
modelsummary(list("OLS Model"=OLS_model, "IV Model"=lincomeIV_model, "Endogneity Test-Model"=endog_test),
             statistic="({std.error})", 
             stars=c('*'=.1, '**'=.05, '***'=.01), 
             coef_map=c("(Intercept)" = "Constant",
                        "ltotexpend"="Log of total expenditure",
                        "age"="Age",
                        "kids"="Kids",
                        "fit_ltotexpend"="Log of total expenditure",
                        "vhat"="FS Extracted Coefficients"
                       ),
             gof_omit="AIC|BIC",
             conf.int=TRUE,
             conf_level = 0.95,
             output="kableExtra",
             align="llll",
             escape=FALSE,
             add_header_above=c("  "=1),
             kable_styling=list(full_width=FALSE, bootstrap_options=c("striped", "condensed")),
             )
OLS Model  IV Model Endogneity Test-Model
Constant 0.896*** 0.952*** 0.952***
(0.029) (0.054) (0.054)
Log of total expenditure -0.146*** -0.160*** -0.160***
(0.006) (0.013) (0.013)
Age 0.002*** 0.002*** 0.002***
(0.000) (0.000) (0.000)
Kids 0.034*** 0.035*** 0.035***
(0.005) (0.005) (0.005)
FS Extracted Coefficients 0.018
(0.016)
Num.Obs. 1519 1519 1519
R2 0.286 0.284 0.287
R2 Adj. 0.285 0.282 0.285
RMSE 0.09 0.09 0.09
Std.Errors Heteroskedasticity-robust Heteroskedasticity-robust Heteroskedasticity-robust
* p < 0.1, ** p < 0.05, *** p < 0.01


The hypothesis:

\[ H_0: \delta_1=0 \ \ \ \ \ \ \Longrightarrow \text{`ltotexpend` is exogenous} \] \[ H_1: \delta_1 \neq 0 \ \ \ \ \ \ \Longrightarrow \text{`ltotexpend` is endogenous} \] From the table vhat is not of statistical significance. Thus, we fail to reject the null-hypothesis (we cannot reject the null-hypothesis). This suggests that there persists no strong statistical evidence that ltotexpend is endogenous.

Comparing with the OLS-model.

  • The coefficient becomes more negative in the endogeneity test model (same for the IV-model), indicating that once endogeneity is accounted for, the estimated impact of log expenditure on food share is tronger.

  • This reflects how IV operates: corrects for attenuation bias that occurs in OLS when the regressor is measured with error or correlated with the error term.

Overidentifying restrictions.

There persists no overidentifying restrictions to test because:

  • Only one instrument lincome is used.

  • Only one endogenous regressor ltotexpend is suspected.

An over-identification test (like the Hansen J-test) requires more instruments than endogenous variables, which is not the case here.


(vi)

\[ \text{sfood}_i=\beta_0+\beta_1\text{ltotexpend}_i+\beta_2\text{age}_i+\beta_3\text{kids}_i+u_i \] \[ \Longrightarrow \text{salcohol}_i=\beta_0+\beta_1\text{ltotexpend}_i+\beta_2\text{age}_i+\beta_3\text{kids}_i+u_i \]

Display Code
alcohol_ols <- feols(salcohol ~ ltotexpend + age + kids, data=data, se="hetero")
alcohol_iv <- feols(salcohol ~  age + kids |0| ltotexpend ~ lincome, data=data, se="hetero")
modelsummary(list("OLS Model (salcohol)"=alcohol_ols, "IV Model (salcohol)"=alcohol_iv),
             statistic="({std.error})", 
             stars=c('*'=.1, '**'=.05, '***'=.01), 
             coef_map=c("(Intercept)" = "Constant",
                        "ltotexpend"="Log of total expenditure",
                        "age"="Age of household head",
                        "kids"="Number of children: 1 or 2",
                        "fit_ltotexpend"="Log of total expenditure",
                        "vhat"="FS Extracted Coefficients",
                        "salcohol"="Share of alcohol expenditures"
                       ),
             gof_omit="AIC|BIC",
             conf.int=TRUE,
             conf_level = 0.95,
             output="kableExtra",
             align="lll",
             escape=FALSE,
             add_header_above=c("  "=1),
             kable_styling=list(full_width=FALSE, bootstrap_options=c("striped", "condensed")),
             )
OLS Model (salcohol)  IV Model (salcohol)
Constant 0.009 -0.000
(0.019) (0.040)
Log of total expenditure 0.028*** 0.030***
(0.004) (0.010)
Age of household head -0.001*** -0.001***
(0.000) (0.000)
Number of children: 1 or 2 -0.013*** -0.013***
(0.003) (0.003)
Num.Obs. 1519 1519
R2 0.055 0.055
R2 Adj. 0.053 0.053
RMSE 0.06 0.06
Std.Errors Heteroskedasticity-robust Heteroskedasticity-robust
* p < 0.1, ** p < 0.05, *** p < 0.01


Interpretation:

  • The coefficient on ltotexpend is positive and highly significant in both models.

  • This means that as log total expenditure increases, the budget share spent on alcohol rises.

  • Alcohol behaves as a luxury good: wealthier households tend to spend a larger share of their budget on it.